Job Radar. Live notifications. AI processed.
upwork.com 2026-04-24 🟠
🔹 [Target] Improve data quality for a RAG system by cleaning and structuring reference documents [Method] One-off project focusing on document processing, normalization, and metadata tagging [UI/UX] Not applicable [Stack] Python, unstructured.io, LlamaParse, PyMuPDF, Docling, Pinecone, Weaviate, Qdrant [Security] Not specified [Format] JSON
👤 Client: undefined GBR Member since 2026-03-09
💰 Price: $500
🚩 Problem: Ensure high-quality data for a RAG system by cleaning and structuring reference documents.
📦 Existing: Not specified
Specifications:
[Target] Improve data quality for a RAG system by cleaning and structuring reference documents
[Method] One-off project focusing on document processing, normalization, and metadata tagging
[UI/UX] Not applicable
[Stack] Python, unstructured.io, LlamaParse, PyMuPDF, Docling, Pinecone, Weaviate, Qdrant
[Security] Not specified
[Format] JSON
Workflow:
1. Assess document types (PDFs, Word, HTML) and identify any OCR issues.
2. Develop a cleaning pipeline to remove headers/footers, fix broken text, handle multi-column layouts, etc.
3. Implement structured chunking that respects the document hierarchy.
4. Extract tables while preserving their structure in markdown or JSON format.
5. Define and implement a metadata schema supporting source attribution down to section/page level.
6. Output clean, chunked, metadata-tagged data ready for vector database ingestion.
7. Review and improve the existing RAG setup focusing on embedding choice, retrieval quality, and tuning for sub-second latency.
8. Document findings and handover process with developer via a short walkthrough call.